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Abstract —This paper considers the prohlem of communication 
over a discrete memoryless channel (DMC) or an additive white 
Gaussian noise (AWGN) channel snbject to the constraint that the 
probability that an adversary who observes the channel outputs 
can detect the communication is low. Specifically, the relative 
entropy between the output distributions when a codeword is 
transmitted and when no input is provided to the channel must 
be sufficiently small. For a DMC whose output distribution 
induced by the “off” input symbol is not a mlxtnre of the output 
distributions Indnced by other inpnt symbols, it is shown that the 
maximum amount of information that can be transmitted under 
this criterion scales like the sqnare root of the blocklength. The 
same is true for the AWGN channel. Exact expressions for the 
scaling constant are also derived. 

Index Terms —Low probability of detection, covert communi¬ 
cation, information-theoretic security, Fisher information. 

1. Introduction 

In many secret-communication applications, it is required 
not only that the adversary should not learn the content of 
the message being communicated, as in [1], but also that it 
should not learn whether the legitimate parties are commu¬ 
nicating at all or not. Such problems are often referred to 
as communication with low probability of detection (LPD) or 
covert communication. Depending on the application, they can 
be formulated in various ways. 

In [2] the authors consider a wiretap channel model [3], 
and refer to this LPD requirement as stealth. They show that 
stealth can be achieved without sacrificing communication rate 
or using an additional secret key. In their scheme, when not 
sending a message, the transmitter sends some random noise 
symbols to simulate the distribution of a codeword. There are 
many scenarios, however, where this cannot be done, because 
the transmitter must be switched off when not transmitting 
a message. Indeed, the criterion is often that the adversary 
should not be able to tell whether the transmitter is on or off, 
rather than whether it is sending anything meaningful or not. 
It is the former criterion that is considered in the current paper. 
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Our work is closely related to the recent works [4]-[6]. 
In [4] the authors consider the problem of communication 
over an additive white Gaussian noise (AWGN) channel with 
the requirement that a wiretapper should not be able to tell 
with high confidence whether the transmitter is sending a 
codeword or the all-zero sequence. It is observed that the 
maximum amount of information that can be transmitted under 
this requirement scales like the square root of the blocklength.' 
In [5] the authors consider a similar problem for the binary 
symmetric channel and show that the “square-root law” also 
holds. One major difference between [4] and [5] is that in 
the former the transmitter and the receiver use a secret key to 
generate their codebook, whereas in the latter no secret key 
is used. More recently, [6] studies the LPD problem from a 
resolvability perspective and improves upon [4] in terms of 
secret-key length. 

In the current paper, we show that the square-root law holds 
for a broad class of discrete memoryless channels (DMCs).^ 
Furthermore, we provide exact characterizations for the scaling 
constant of the amount of information with respect to the 
square root of the blocklength for DMCs as well as AWGN 
channels, which is not done in [4]-[6]. 

We do not assume that the eavesdropper observes a noisier 
channel than the intended receiver; instead, we assume that 
they both observe the same channel outputs. Our reason 
for dropping the wiretap structure is that, unlike in secret 
communication where the assumption that the eavesdropper 
observes a noisier channel allows one to obtain information- 
theoretic secrecy without using a secret key, in LPD problems 
the wiretap assumption does not bring essential new insights. 
In particular, the square-root law does not rely on the wiretap 
structure.^ Hence, by putting the eavesdropper in the same 
position as the intended receiver, we allow ourselves to focus 
on the essence of the LPD-communication problem, while at 
the same time making our results more relevant in practice, 
the latter because in applications the legitimate parties usually 
cannot fully determine the statistical behavior of the eaves¬ 
dropper’s channel. We also note that extension of most of the 
results in the paper to wiretap channels is straightforward, part 
of which can be seen in [6]. 

’We adopt the usual terminology to use “blocklength” to refer to the total 
number of channel uses by a code. However, in the square-root case, the 
channel codes are not “block codes” in the traditional sense, because they 
cannot be used repeatedly. Indeed, repeated tramsmission would increase the 
eavesdropper’s probability of detecting the communication. 

^The achievability part of the square-root law, but not the converse, is 
independently derived in [6], 

^In fact, one can verify that the results in [4] hold without the wiretap 
assumption; see Section V of the current paper for stronger results. 



2 


Because we do not assume a wiretap structure, contrary to 
[5], in our setting LPD communication is impossible without 
a secret key. We assume that such a key is available, and are 
not concerned with its length within the scope of this paper. 

We assume that the receiver does know when the transmitter 
is sending a message. This is a realistic assumption because 
the transmitter and the receiver can use part of their secret key 
to perform synchronization prior to transmission; They choose 
a (large enough) number of input sequences of a certain length 
such that each sequence induces an output distribution that is 
sufficiently different from the output distribution when there 
is no input to the channel, while on average these sequences 
induce an output distribution that is sufficiently close to the 
output distribution when there is no input. Using part of the 
secret key they randomly pick one of these sequences, which 
the transmitter sends to the receiver as a synchronization signal 
before sending a message. 

One technical difference between [4], [5] and the present 
work is that the earlier works use total variation distance 
to measure probability of detection whereas we use relative 
entropy, as [2], [7]. Note that, when the relative entropy is 
given, the total variation distance can be upper-bounded using 
Pinsker’s inequality [8]. See [2] for further discussions on the 
relation between relative entropy and detectability. In practice, 
which of the two quantities is more relevant may depend on 
the actual application,^ whereas for theoretical analysis relative 
entropy is clearly easier to handle. 

Summarizing the above discussions, we now briefly describe 
our setting: 

• We consider a DMC whose input alphabet contains an 
“off” symbol. When the transmitter is switched off, it 
always sends this symbol. 

• The transmitter and the receiver share a secret key that 
is sufficiently long. 

• We assume that the adversary observes the same channel 
outputs as the intended receiver, i.e., there is no wiretap 
structure. 

• The LPD criterion is that the relative entropy between 
the output distributions when a codeword is transmitted 
and when the all-zero sequence is transmitted must be 
sufficiently small. 

The square-root law has been observed in various scenarios 
in steganography [9]-[ll]. The setup in steganography that 
is most related to our work is as follows: a data file called 
the cover text is generated according to some distribution, 
and a message must be concealed in this file subject to the 
constraint that the file should look almost unchanged. This 
is similar to the LPD setting in the sense that, when no 
message is to be conveyed, the encoder should not do anything, 
hence, in steganography the output is the original data file, 
whereas in LPD communications the output is pure noise. 
But steganography and LPD communications are essentially 
different: in steganography the data file is generated first and 

■^The total variation distance would be the right quantity to look at if one 
assumes equal probabilities for the transmitter sending and not sending a 
message, because it would correspond to the minimum probability of detection 
error by the eavesdropper. However, such an assumption is clearly unrealistic 
in practice. 


shown to the encoder, whereas in LPD communications noise 
is added to the codeword after the latter is chosen by the 
encoder. Hence the two types of problems require different 
analyses. 

The rest of this paper is arranged as follows. In Section II 
we formulate the problem for DMCs and briefly analyze 
the case where the “off” input symbol induces an output 
distribution that can be written as a mixture of the other output 
distributions; the next two sections focus on the case where 
it cannot. In Section III we derive formulas for characterizing 
the maximum amount of information that can be transmitted 
over any DMC under the LPD constraint. In Section IV we 
derive a simpler formula that is applicable to some DMCs. 
In Section V we formulate and solve the problem for AWGN 
channels. Finally, in Section VI we conclude the paper with 
some remarks on future directions. 

11. Problem Formulation for DMCs 

Consider a DMC of finite input and output alphabets X 
and y, and of transition law lU(j-). Throughout this paper, 
we use the letter P to denote input distributions on X and the 
letter Q to denote output distributions on y. Let 0 G X he the 
“off” input symbol; i.e., when the transmitter is not sending a 
message, it always transmits 0. Denote 

Qo(.) A^(.|0). (1) 

Without loss of generality, we assume that no two input 
symbols induce the same output distribution; in particular, 
W{-\x) = Qo{-) implies a; = 0. 

A (deterministic) code of blocklength n for message set Ad 
consists of an encoder Ad —> Af", m i—> x” and a decoder 
(V" —> Ad, t/" I—> m. The transmitter and the receiver choose 
a random code of blocklength n for message set Ad using 
a secret key shared between them. The adversary is assumed 
to know the distribution according to which the transmitter 
and the receiver choose the random code, but not their actual 
choice.^ 

The random code, together with a message M uniformly 
drawn from Ad, induces a distribution (5”(-) on (V”. We 
require that, for some constant 6 > 0,® 

<5. (2) 

Here Qq” denotes the n-fold product distribution of Qq, 
i.e., the output distribution over n channel uses when the 
transmitter is off. 

At this point, we observe that an input symbol x with 
supp(lU(-|x)) % supp((5o)^ where supp(-) denotes the support 
of a distribution, should never be used by the transmitter. 
Indeed, using such an input symbol with nonzero probability 
would result in D ((5"|j Qq") being infinity. Hence we can 
drop all such input symbols, as well as all output symbols that 
do not lie in supp((5o), reducing the channel to one where 

supp(Qo) = y- (3) 

^Note that we assume that the eavesdropper observes the same channel 
outputs as the intended receiver, so LPD communication is impossible with 
deterministic codes. 

^All logarithms in this paper are natural. Accordingly, information is 
measured in nats. 
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Throughout this paper we assume that (3) is satisfied. Note 
that, for channels that cannot be reduced to one that satis¬ 
fies (3), such as the binary erasure channel, nontrivial LPD 
communication is not possible. 

Our goal is to find the maximum possible value for log \M.\ 
for which a random codebook of length n exists that satisfies 
condition (2), and whose average probability of error is at 
most e. (Later we shall require that e be arbitrarily small.) We 
denote this maximum value by e). 

We call an input symbol x redundant if W{-\x) can be 
written as a mixture of the other output distributions, i.e., if 

W{-\x) G conv {W{-\x'): x' £ X,x' ^ x} , (4) 

where conv denotes the convex hull. As we shall show, 
Kn {5, e) can increase either linearly with the blocklength n 
or like depending on whether 0 is redundant or not. 

A. Case 1: input symbol 0 is redundant 

This is the case where there exists some distribution P on 
X such that 

P(0) = 0 (5a) 

Y,P{x)W{-\x)=Qo{-). (5b) 

x^X 

In this case, a positive communication rate can be achieved; 

Proposition 1. If input symbol 0 is redundant, then for any 

(5 > 0 , 

lim lim —= max/(P, kk), (6) 

eiO n—J-oo n 

where the maximum is taken over input distribution P that 
satisfies (5). 

Proof: First note that a random codebook generated IID 
according to P that satisfies (5) yields D{Q'^\\Qq'^) = 0. By 
the standard typicality argument [12], when the rate of the code 
is below /(P, W), the probability of a decoding error can be 
made arbitrarily small as n goes to infinity. Conversely, for a 
codebook whose empirical input distribution does not satisfy 
(5b), P((5"||(5 q ") grows linearly in n and is hence unbounded 
as n goes to infinity. Finally, we check that any P that does 
not satisfy (5a) is suboptimal. Indeed, for any (nontrivial) P 
that satisfies (5b) but not (5a), let P' be P conditional on 
A’\ {0}, then P' also satisfies (5b) and /(P', W) > I{P, W). 

■ 

Example 1. Binary symmetric channel with an additional 
“off” symbol. 

Consider a binary symmetric channel with an additional 
“off” symbol as shown in Fig. 1. Its optimal input distribution 
for LPD communication is uniform on { — 1,1}, and its capac¬ 
ity under the LPD constraint (2) is the same as its capacity 
without this constraint, and equals 1 — H^{p), where Pb( ) is 
the binary entropy function. 


1-p 



Fig. 1. A binary symmetric channel on the alphabet { — 1,1} with cross¬ 
over probability p, with an additional “off” input symbol 0 which induces a 
uniform output distribution. 


1-p 



Fig. 2. The binary symmetric channel with cross-over probability p. 

B. Case 2: input symbol 0 is not redundant 

This is the case where no P satisfying (5) can be found. It 
is the focus of the next two sections. A simple example for 
this case is the binary symmetric channel in Fig. 2. 

We shall show that, in this case, Kn grows like y/n. Let 

P 4 lim lim , (7) 

n—¥oo V 71(5 

where Hm denotes the limit inferior. Note that both Kn{5^e) 
and 5 have unit nat, so L has unit s/nai. We shall characterize 
L in the next two sections. Note that, by definition, L can be 
infinity, as it is in Case 1. 

At this point, we provide some intuition why positive 
communication rates cannot be achieved in this case. To 
achieve a positive rate, a necessary condition is that a non¬ 
vanishing proportion of input symbols used in the codebook 
should be different from the “off” symbol 0. This would 
mean that the average marginal distribution P on A has a 
positive probability at values other than 0 and, since Qq cannot 
be written as a mixture of output distributions produced by 
nonzero input symbols, the average output distribution Q must 
be different from Qq so D{Q\\Qo) > 0. This implies that 
D{Q^\\Qq^) must grow without bound as n tends to infinity, 
violating the LPD constraint (2). 

III. General Expressions eor L eor All DMCs 

In this section we derive computable expressions for L. 
Our focus is on Case 2 where 0 is not redundant, though 
some results also hold (in a trivial way) in Case 1 where 0 is 
redundant. We first prove the following natural but nontrivial 
single-letter formula. 
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Theorem 1. For any DMC, 


output Yi. Let y” have distribution Q", then (see also [14]) 


L = max lim 

{Pr.} n—¥oo 



I{Pn,W) 


(8) 


where the maximum is taken over sequences of joint distribu¬ 
tions on X xy induced by input distributions Pn and channel 
W, whose marginals Qn on y satisfy 

r 

DiQnWQo) < -■ ( 9 ) 

n 


Remark: Although the proof below does not guarantee that 
the limit inferior in (8) can be replaced by the limit, this is 
indeed the case, as we show at the end of this section. 

Proof of Theorem 1: Proposition 1 shows that, when 
input symbol 0 is redundant, L = oo. This is consistent with 
Theorem 1. The rest of the proof focuses on Case 2 as in 
Section II-B, where 0 is not redundant. 

We first prove the converse part. This is done via Fano’s 
inequality and manipulation of the information quantities. 

Suppose there exists a sequence of random codes satisfying 
(2), where, at blocklength n, the size of the codebook is 
exp(Ar„), and the error probability is e„ which tends to zero 
as n tends to infinity. By a standard argument using Fano’s 
inequality [13], 


iT„(l-e„)-l </(y";r"). (10) 

Let Pn denote the average input distribution on X, averaged 
over the codebook and over the n channel uses. We upper- 
bound /(Ar”;y") in the usual way: 

n 

n 

n 

= J2H{Y,\Y^-^)-H{Y,\Xf 

n 

<nI{Pn,W), (11) 

where the last step follows because, when the channel law is 
fixed, mutual information is concave in the input distribution. 
Combining (7), (10), and (11) yields 

L< lim ./|j(P„,VF). (12) 

Next let Qn denote the average output distribution on y. 
Clearly, Qn is the output distribution induced by Pn through 
W. Recall that Q" denotes the n-fold output distribution on 
[V". Further let Qn,i denote the marginal of Q" on the ith 


D{Q^\\Q^^)=-H{Y^) + Eq 

n 

= -J2h{y,\y 


log 








log 


1 






log 


QoiY^) 

1 


QoOf) 


>-J2H{Yi} + E, 

i=^l 

n 

= J2D{Qn4Qo) 


log 


QoiY^) 


> nD{Qn\\Qo) 


(13) 


where the last step follows because relative entropy is convex. 
This combined with (2) implies that 

r 

^(OnllQo) < -• (14) 

n 

Combining (12) and (14) proves the converse part of Theo¬ 
rem 1. 

We next prove the achievability part. To this end, we 
randomly generate a codebook that satisfies (2) and then 
show that, as the length of the codewords tends to infinity, 
the probability of a decoding error can be made arbitrarily 
small provided that the codebook has a size smaller than that 
determined by the right-hand side of (8). 

Let {Pn} be a sequence of input distributions such that the 
induced output distributions {Qn} satisfy (9). For every n, 
we randomly generate a codebook by choosing the codewords 
IID according to P„. The decoder performs joint-typicality 
decoding. 

It is clear that the output distribution on for this code 
is Q” = Qn" that (2) is satisfied. It remains to show 
that, provided that the size of the codebook is smaller than 
exp (n/(P„, VF) — yfien) for some e„ tending to zero as 
n tends to infinity, the probability of a decoding error can 
be made arbitrarily small. This cannot be shown using the 
asymptotic equipartition property [12], or the information- 
spectrum method [15], [16], because we are in a situation 
where communication rate is zero. However, by slightly 
varying the methods in [15], [16], or using the one-shot 
achievability bounds as in [17], [18], we can obtain that the 
sequence is achievable provided 


n-^oo s/n n^oo Qn (Y^ 


( 15 ) 


where P-liminf denotes the limit inferior in probability, 
namely, the largest number such that the probability that the 
random variable in consideration is greater than this number 
tends to one as n tends to infinity. Recalling (7), to prove the 
achievability part of Theorem 1, it now suffices to show that 
the right-hand side of (15) is lower-bounded by 


lim Vn/(P„,kF). 

n—¥(X) 
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We show a slightly stronger result which is 
1 W — 

■ log -VnIiPn,W)^0 in probability 


as n tends to infinity. To this end, first note 


(16) 


1 iT"(y”|x") 

QTiY'-) 


1 


= -^I{X^-Y^) = y/nI{Pn,W). 
Vn 

(17) 

It then follows by Chebyshev’s inequality that, for any constant 
a > 0, 


Pr 




> a 


Thus, to prove (16), it suffices to show 


Thus, since P„(0) is bounded between 0 and 1, the first term 
on the right-hand side of (21) tends to zero as n tends to 
infinity. To analyze the second term on the right-hand side 
of (21), recall our assumption that Qq cannot be written as a 
mixture of the other output distributions. Thus, to have (22) 
we need 

lim P„(0) = 1, (25) 


so 


lim Pn{x) =0, Vx 7 ^ 0. 

n—¥oc> 


(26) 


We next use (22) to obtain (recall again that |3^| is finite) 


lim E 




Wi-\x) 


log 


= E 


W(-\x) 


W{Y\x) 

QniY) 


WiY\x) 

QoiY) 


log 


(27) 


, 1 , WCY^IX^) 


0 


(19) 


as n tends to infinity. To show (19), we first simplify this 
variance to 


1 VE(y"|x”) 






WjY^lX,) 

Qn(Y,) 


= var ( log 


Qn{Y) ) ■ 


( 20 ) 


The variance on the right-hand side of (20) is upper-bounded 
by the second moment: 






® Qn{Y) ) 


= Pn{0) Eq„ 


log 


Qo{Y) 

Qn{Y) 


■ ^ ' Pn{x) E^ypi^) 
x^O 


log 


W{Y\x) 

Qn{Y) 


■ ( 21 ) 


Here we use P^oW to denote the joint distribution on T' x 3^ 
induced by input distribution P„ through channel W. To prove 
(19), it suffices to show that both terms on the right-hand side 
of (21) tend to zero as n tends to infinity. For the first term, 
note that (9) requires that 


Qn ^ Qq 


( 22 ) 


as n tends to infinity, so 

lim log = 0, Vy e 3^, (23) 

n-)-oo Qn{y) 

which further implies (recall that |3^| is finite so one can switch 
the order of limit and expectation) 


lim E 

n—¥oo 


Qo 



Qn{Y)) 


= 0 . 


(24) 


which is finite for every x € X, x ^ 0, because Qo{y) > 0 
for every y G y', recall (3). This combined with (26) implies 
that the second term on the right-hand side of (21) tends to 
zero as n tends to infinity. 

We have now established that the right-hand side of (21) 
tends to zero as n tends to infinity, which further establishes 
(19) and, hence, (16). This concludes the achievability part of 
Theorem 1. ■ 

Using Theorem 1 we derive the following computable 
expression for L. 


Theorem 2. For any DMC satisfying (3), whose “off" input 
symbol 0 is not redundant, and which has at least one input 
symbol other than 0,^ L is positive and finite, and is given by 


L = max 

P: P(0)=0 


^xGX P{x)D{{W{-\x)\\Qo) 


\ 


-T 

y^y 


{Q{y) - Qo{y)y 


(28) 


Qo{y) 


where Q is the output distribution induced by P through W. 


Before proving Theorem 2 we note that, for some channels, 
such as the next example, (28) is very easy to compute. 


Example 2. Binary symmetric channel. 


Consider the binary symmetric channel in Fig. 2. Clearly, 
the only possible choice for P in (28) is P(l) = 1. We 
thus obtain the value of L as a function of p, which we 
plot in Fig. 3. Not surprisingly, when p approaches 0.5, L 
approaches zero, as does the capacity of the channel. It is 
however interesting to notice that, when p approaches zero, L 
also approaches zero, even though the capacity of the channel 
approaches 1 bit per use. This is because, when p is very 
small, it is very easy to distinguish the two input symbols 0 
and 1 at the receiver end. Hence the LPD criterion requires 
that the transmitter must use 1 very sparsely, limiting the 
number of information bits it can send. The maximum of L 
is approximately 0.94 v^nS, achieved at p = 0.083. 


^By our assumption, this input symbol induces an output distribution that 
is different from Qo. so the channel is not trivial. 
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Fig. 3. The value of L for the binary symmetric channel in Fig. 2 as a 
function of p. 


Proof of Theorem 2: For every n, let 

Pr, = argmax/(P^, W) (29) 

Pn 

subject to 

DiQnWQo) < -■ (30) 

n 

Using the same argument as for (25), we have 

lim P„(0) = 1, (31) 

n—foci 

hence P„ can be written as 

P„ = (1 - Pn)Po + TnPn (32) 

where Pq is the deterministic distribution with Po(0) = 1, Pn 
is a distribution with Pn(0) = 0, and is positive and tends 
to zero as n tends to inhnity. Fix P„ and consider Pn given 
by (32) as a function of /Lt„, then 


d/(P„,VF) 


d/in 




hence 


Y, Pnix)D{Wi-\x)\\Qo), 

xGX 


(33) 


where the term o(/i^) tends to zero faster than /i^ as n tends 
to inhnity. By (30) and (36), /i„ should have the form 


f^n — 


1 




1 {Qn{y) Qo{y)) 

2 ^ 


:+o(n-^/2^ . (37) 


Qoiy) 


vey 

Plugging (37) into (34) yields 


2 ^ 

yey 


'i 


(Qniy) - Qo{y)y 


Qoiy) 


(38) 


When n tends to inhnity, /(P„,PF) is dominated by the hrst 
term on the right-hand side of (38), hence P„ should tend 
to the (not necessarily unique) distribution that maximizes 
this term. Recalling Theorem 1, this completes the proof of 
Theorem 2. ■ 

From the proof of Theorem 2 it follows that the limit inferior 
in (8) can be replaced by the limit, yielding a more convenient 
expression for L: 


Corollary 1. For any DMC, 


L = lim 

n—foo 



U^xI{Pn, W) 


where the maxima are subject to (9). 


(39) 


Proof: We only need to show that the limit in (39) exists. 
When input symbol 0 is redundant, this limit exists and is 
inhnity. When 0 is not redundant, the proof of Theorem 2 
shows that this limit also exists and equals the right-hand side 
of (28). ■ 


IV. A Simpler but Less General Expression eor L 
In this section we consider channels that satisfy the follow¬ 
ing condition. 

Condition 1. There exists a capacity-achieving input distri¬ 
bution that uses all the input symbols. 


I{Pn,W) = /r„ y Pn{x)D{W{-\x)\\QQ) + o(/r„), (34) 

xGX 

where the term o(/i„) tends to zero faster than /j,„ as n tends 
to inhnity. 

The output distribution resulting from feeding Pn given by 
(32) into the channel W is 

Qn (1 Tn')Qo “b PnQn (35) 

where Qn is the output distribution induced by input distri¬ 
bution Pn through W. The relative entropy DiCjnWQf) is 
approximated by the Fisher Information [19] with respect to 
parameter /x„; 

T-,tA 11 ^ ^ Fra {Qn{y) — Qoiy)) , ^ 2 ^ 
D(Qn\\Qo) = ^Y +o(Fra)^ (36) 

2 Q^y) 


Note that Condition 1 implies that no input symbol is 
redundant; in particular, 0 is not redundant. 

We next give a simple upper bound on L under Condition 1. 
Later we provide an additional condition under which this 
bound is tight. 

Theorem 3. Consider a DMC that satisfies Condition 1. 
Denote its capacity-achieving output distribution by Q*, then 

W) 

where varQ(,(-) denotes the variance of a function ofY where 
Y has distribution Qq. 

The proof of Theorem 3 utilizes the following lemma. 

Lemma 1. Let Q* denote the capacity-achieving output dis¬ 
tribution for a DMC 1U( • | •) of capacity C. Let P' be any input 
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distribution, and let Q' denote the output distribution induced 
by P' through W. Then 


I{P\W)<C-D{Q'\\Q*), 


(41) 


where equality holds if supp(P') C supp(P*) for some 
capacity-achieving input distribution P*. 

Proof: We have the following identity (see [20]): 

I{P',W)=Y,P'{x)D{Wf\x)\\Q') 

w{Y\xy 

r |iug 




= ^ P'{x)Ew(. 


xGX 


W(-\ 


W(-\x) 

log 


log 
log 


Q'{Y) 

W{Y\x) 


Q*{Y) 
Q'{Y) 


= E 

.Q*(y)J, 

= E P'ix)D{Wf\x)\\Q*) - DiQ'WQ*). (42) 

By the Kuhn-Tucker conditions for channel capacity [8], 

D{Wi-\x)\\Q*)) < C (43) 

where equality holds if a; € supp(P*). We hence have 

C=J2P*i^)DiW{-\x)\\Q*) 


xGX 


>J2Pfx)D{Wi-\x)\\Q*), 


(44) 


xex 


where equality holds if supp(P') C supp(P*). Combining 
(42) and (44) proves the lemma. ■ 

Proof of Theorem 3: Since the channel satisfies Condi¬ 
tion 1, from Lemma 1 and Corollary 1 we have 


L= lim W - (C - minP(Q„||Q*)), (45) 

n—>-oo y 0 

where the minimum is over Qn € conv{fL(-|a :): x G X} sat¬ 
isfying (9). To determine L, we need to find Qn that minimizes 
D{Qn\\Qo) for a fixed D{Qn\\Q*)- To find an upper bound 
on L, we drop the condition Qn G conv{fL(-|a;): x G X} 
to consider all distributions on y. Then the minimum is well 
known to be achieved by a distribution from the exponential 
family connecting Qo and Q* [21]: 

Qo{y?-YQ*{y)Y 


Qniy) = 


Ey'eyQo{y'V-YQ*iy'Y 


y&y (46) 


for some A„ G [0,1]. Indeed, if a distribution Qn minimizes 
P{Qn\\Q*) for some fixed D{Qn\\Qo), then it must minimize 

(1 - A„)P(Q„||Qo) + X„D{Qn\\Q*) 
for some A„ G [0,1]. This sum can be written as 
(1 - A„)P(Q„||Qo) + XnDiQnWQ*) 

= DiQjRn) - log E Qoiy'Y-^-Q*iy'YY (47) 


y'ey 


where 


Rn(y) = 


QoiyV-^-QYyy 


Ey^^yQoiy'V-YQYy'Y 


y&y- (48) 


Hence the best choice is Q„ = Rn- 

It remains to compute D{Qn\\Qo) and D{Qn\\Q*), where 
Qn is of the form (46), for large n. When n is large, Q„ must 
be close to Qq and hence A„ must be close to zero. In this case, 
D{Qn\\Qo) is approximated by the Fisher Information [19] 
with respect to parameter A„: 

D{Qn\\Qo) = yvarQo ^log + ^XD- (49) 

This together with the requirement that Qn must satisfy (9) 
implies that 


An < 




26 


M Qo{Y)\ 


+ o{n-YY- (50) 


Next we compute the derivative of D{Qn\\Q*), with Q„ given 
in (46), with respect to A„ evaluated at A„ = 0 to be 


dP(Q„||Q*) 


dAn 


= -vatQo 


A„=0 


, QoiY) \ 
^ Q*iY))- 


(51) 


By Condition 1, there exists a capacity-achieving input distri¬ 
bution that uses 0, so 


limP(Q„||Q*) = P(Qo||Q*)=C. (52) 

Ati 4'0 

Hence 

C - DiRnWQ*) = A„varQ„ (^log + o(A„). (53) 

Combining (45), (50), and (53) proves (40). ■ 

The bound (40) is tight for many channels, e.g., the binary 
symmetric channel of Example 2. We next provide a sufficient 
condition for (40) to be tight. 

Let s be the |[y|-dimensional vector given by 

s{y) = Qo{y) (\og^^ + Cy yey. (54) 

Consider the following system of linear equations with un¬ 
knowns ax, X G X \ {0}: 

E a.(IL(-|a:)-Qo) =s. (55) 

a;eA'\{0} 

Solving (55) is a simple problem in linear algebra. 

Theorem 4. Suppose Condition 1 is satisfied. If (55) has a 
nonnegative solution, then (40) holds with equality: 

The intuition behind Theorem 4 is the following: the vector 
s represents the tangent of the curve Q„ (y) given by (46) as a 
function of A„ at A„ = 0. That (55) has a nonnegative solution 
means that s lies in the convex cone generated by {W{-\x) — 
Qq: X G X\ {0}}. This further implies that, for small enough 
An, Qn of the form given by (55) is a valid output distribution, 
which, as can be seen in the proof of Theorem 3, guarantees 
(40) to hold with equality. Along a different direction, we 
provide below a proof utilizing Theorem 2. 
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Proof of Theorem 4: We use Theorem 2 to prove 
Theorem 4. Let : a: G -T \ {0}} be a nonnegative solution 
to (55), and let 

41= ^ a.. (57) 

Then the following constitutes a valid choice for P in (28): 

P(x) = ^, xGT'\{0}. (58) 

The corresponding Q is given by 

E ^^(-1^) 

a;eA'\{0} 

= *50 + ^ E «a;(VL(-|a;) - Qo) 

a:eA'\{0} 

= Qo + ;|. (59) 

We evaluate (28) for this choice of P to obtain a lower bound 
on L. We first compute the denominator, using (59): 



We next compute the numerator: 


^ P{x)D{iWi-\x)\\Qo) 

xex\{o} 

= E ^(a^)E W{y\x) log 

a:eA:\{o} y&y 


E ^(^)E W{y\x) log 

i:eA'\{o} y&y 


W{y\x) 

Q*iy) 


Qo{y) 


= ^(a;)-C'+4 E (XxW{y\x)\og 


a:eA:\{0} 


a:eA'\{0} 

yey 


Q*{y) 

Qoiy) 


=c+ 4 i:i»s§^ i: “-"-w 

= C'+ 4 i:iog|^(>l< 3 o(!/)+.(!/)) 

= c-DiQ4Qn + ^Y.‘i«n-^'^^ 


= c-c+-J2Qoiy)iog'- 


Q*(y) 


= log 


y&y 

Q*iy) 


Qoiy) V Qoiy) 


Qoiy)) ’ 


where (61) follows from (55). Combining Theorem 2, (60), 
and (62) yields 



Recalling Theorem 3, both (40) and (63) must hold with 
equality. ■ 

Example 3. A k-ary uniform-error channel. 

Consider a channel with X = 3^ = {0,1,..., A: — 1} and 


{ 1 — p, y = X 

P „ ^ (64) 

k-V 

where p G (0,1). Clearly, its capacity-achieving output distri¬ 
bution Q* is uniform. It is easy to check that (55) has solution 

p(l-p)(log((fc-1)(1-p) - logp)) ^ v \ rm 

ax = - 7) -rTTz-^-, a;GT'\{0} 

ik - 1)(1 -p) -p 

(65) 

which is nonnegative. We can hence use Theorem 4 to obtain 


L = yj2vik,p) 


where 


v(A:,p) = (1-p) ^log -fp^log^^^^ 

/ 1 k — \ 

- (l-p)log- -hplog- . (67) 

V 1-P P ) 

While one might speculate that (56) holds, for example, for 
all symmetric channels, this is, perhaps surprisingly, not the 
case. The following example demonstrates this. 

Example 4. A ternary symmetric channel. 

Consider a ternary symmetric channel where X = y = 
{0,1,2} and 

VL(.|0) = [0.37 0.01 0.62] (68a) 

VL(-|1) = [0.62 0.37 0.01] (68b) 

VL(-|2) = [0.01 0.62 0.37]. (68c) 


The right-hand side of (56) yields 0.66 for this channel, but 
one can check that, in fact, L = 0.62. This is because, as Fig. 4 
shows, the exponential family connecting Qq and Q* in the 
neighborhood of Qq does not lie in the set of possible output 
distributions conv{lL(-|a;): x G X}, or, roughly equivalently, 
s does not lie in the convex cone generated by {Wi-\x) — 
Qo:xGX\{Q}}. 


V. AWGN Channels 
C onsider an AWGN channel described by 

Y = X + Z, (69) 

where AT G M is the channel input, L G M is the channel 
output, and Z G M has the zero-mean Gaussian distribution 
of variance denoted A7(0 ,ct^), and is independent of X. 
Let the “off” input symbol be 0, so Qq is also A/'(0, cr^). The 
encoder and decoder generate a random code as in Section II 



9 



Fig. 4. The ternary symmetric channel in Example 4. The black triangle 
depicts the set of possible output distributions. The blue curves are the 
exponential families connecting the conditional output distributions and the 
capacity-achieving output distribution Q*- The exponential family connecting 
Qo Q* (as the other two exponential families) has a part that lies outside 
the black triangle, which is why (56) does not hold for this channel. 


I{Pn, W) among all distributions of the same second moment 
(see, e.g., [13]), so 

I{Pn,W)<Uog(l + ^). (73) 

Because X and Z are independent, the second moment of the 
distribution is yielding 


P>{Qn\\Qo) = -h{Qn) + Eq„ 
= —h{Qn) + Eq^ 


log 


1 


Qo(Y)\ 

log 


-h{Q„) + - log (27rcr^) + Eq^ 




= -KQn) + ^ log (27rcr2) + 
> -^log(27re(p„ + CT^)) 


+ 2 (27rcr^) + 

Pn 1 Pn ^ 


2a2 

2 


2^2 


(74) 


subject to the LPD constraint (2), and L is again defined as in 
(7). Note that we do not impose any average- or peak-power 
constraint on the input, but imposing such constraints will not 
affect the value of L due to the stronger LPD constraint (2).* 

Theorem 5. For an AWGN channel, 

L = 1 v^nat (70) 

irrespectively of the noise power 

The proof of Theorem 5 is divided into the converse part 
and the achievability part, and is given below. 

A. Converse for Theorem 5 

Examining the proof of Theorem 1, we see that its converse 
part is valid for the AWGN channel. Hence 

L < max lim (71) 

{^n} n—foo V ^ 

where the maximum is taken over sequences of joint distribu¬ 
tions on {X, F) G K X K induced by input distribution Pn via 
the channel law W resulting from the relation (69), such that 
the marginal distributions Qn for Y satisfy 

^(Qn||Qo)<-. (72) 

n 

Let the second moment of the distribution be denoted 
Pn- It is well known that the zero-mean Gaussian maximizes 

^The LPD constraint requires that the average input power tend to zero as 
n tends to infinity, hence rendering any additional average-power constraint 
inactive. As for peak-power constraints, our choice of input distribution to 
achieve L is zero-mean Gaussian with vanishing variance. The influence of 
cutting the tail of such a distribution to meet any peak-power constraint will 
vanish as n tends to infinity. 


where h{-) denotes the differential entropy, and where the 
inequality follows because the zero-mean Gaussian distribu¬ 
tion maximizes differential entropy among all distributions 
of the same second moment. It follows from (74) that, for 
D{Qn\\Qo) to approach zero as n tends to infinity, pn must 
tend to zero and 

D{Qn\\Qo)>{^,+o{pl). (75) 

Combined with (72), this implies 

Pn < 2o-^\/^ + o(n"^/2). (76) 

V n 

Plugging this into (73) we obtain 

I{Pn.W)PpOg(l+^) 

< 

- 2a2 

<J^Ao{n-^/^). ill) 

V n 

Combining (71) and (77) yields 

L < 1. (78) 

This concludes the proof of the converse part of Theorem 5. 

B. Achievability for Theorem 5 

The achievability proof of Theorem 1 relies on the finiteness 
of the input and output alphabets, therefore it is not applicable 
to the AWGN channel. Indeed, Theorem 1 may not hold 
for a general continuous-alphabet channel. However, for the 
AWGN channel, we only need to prove an achievability result 
for Gaussian input distributions, which is much simpler than 
proving it for arbitrary input distributions. 
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For blocklength n, we randomly generate a codebook such 
that every codeword is independent of every other codeword, 
and is IID A/'(0, p„) with 


= 2cr \ -. 
" n 


(79) 


We first check that the LPD condition is met. Indeed, the 
output sequence is IID J\f(0,pn + a^), so 

D (g" II= nD {M{Q,Pn + cr'^) ||A((0,a2)) 

( Pn 1 , Pn + 

pI 

2ct4 



(80) 


y/n 

= —log 


> 


Vn ( Pn pI 


2 

= v^- 


where we again use (81). By (84) we know that 


lim E 

n—>-oo 


1 , kF(y"|x") 

—= los: —^^ 
y/n ^ g^”(r") 


> Vs. 


It remains to show that 


,1 VF(y"|X”) 

Iim var log 


= 0 . 


/rt ° g^”(y") 

Then, by Chebyshev’s inequality, we can establish 

> 




and hence 


L> 1. 


(84) 

(85) 

( 86 ) 

(87) 

( 88 ) 


where for the inequality we use the fact 


log(l + a)>a—a > 0. 


(81) 


We next look at the maximum number of nats that can be 
reliably transmitted with this code. Similar to the DMC case, 
we can show that the sequence {iT„} is achievable if (15) 
holds, except that now g„ and W are density and conditional 
density, respectively. The ratio between W and in (15) 
can be evaluated as 

W{y^\xV 

Qn^ivV 


^V2 




n 


1 


f= 1 yVVpV+VV 

exp 




Pn + 
^2 


2(p„ + 0-2) 2cr2 


(82) 


Hence 


1 , W(Y'^\X 

^log 

Vn 




+ 


1 /EEiE^ YV=,zt 


7n\2(p„ + (T2) 2 (t2 


(83) 


The mean of (83) satisfies 


1 , wcr^ix^) 

— loe —^^ 

V^ g(^"(r") 


/n 


log 


a 


+ 


1 / ELiE[ 7^] EEiE[zf 

Vn I 2(p„ + 0-2) 2 (t2 


Using (83), the variance in (86) can be computed as: 

/ 1 iu(y”|X") 

var log 


n ° g^”(y") 

1 ^EEiE" TLiZf 


n 


= -E 

n ^ 

i=l 

— var 


var 


= E 


V2(p„ + (72) 2ct2 

y," _ EL 

2(p„ + 0-2) 2 (j2 

E"_EL^ 

2) 2a2 ) 

r2 


2(p„ + cr 


2(p„ + 0-2) 20-2 

1 


4(p„ + 0-2)2 

1 


+ 2XZ- 
\ ct2 


< —E 


' X^ + 2XZ - '^Z‘ 
\ 0-2 


(89) 


After expanding the square inside the expectation in (89), one 
can verify that the expectation of every summand tends to zero 
as n tends to infinity, establishing (86), and hence (87) and 
(88), proving the achievability part of Theorem 5. 


VI. Concluding Remarks 

A DMC in practice often represents discretization of a 
continuous-alphabet channel. For example. Figs. 1 and 2 can 
result from two different discretizations of the same AWGN 
channel. In this sense, our results suggest that the optimal 
discretization may depend heavily on whether there is an LPD 
requirement or not. 

In practice, LPD communication systems of positive data 
rates often can be implemented even when the channel model 
does not seem to allow positive rates. Indeed, in such appli¬ 
cations, the concern is often not that the transmitted signal 
should be sufficiently weak, but rather that it should have 
a wide spectrum and resemble white noise [22]. We believe 
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that one of the reasons why such systems may work is that 
realistic channels often have memory. For example, on a 
channel whose noise level varies with a coherence time that 
is longer than the length of a codeword, the transmitter and 
the receiver can use the adversary’s ignorance of the actual 
noise level to communicate without being detected. One way 
to formulate this scenario is to assume that the channel has 
an unknown parameter that is fixed. This is discussed for 
the binary symmetric channel in [23]. Further addressing this 
scenario is part of ongoing research. 
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